Project 1: Exploring the difference between schools of philosophy

Jiapeng Xu JX2427 Date: 09/20/2022

For most people, philosophy maybe obscure and confusing. It usually takes lots of time to understand the overall conceptions between different schools of philosophy. However, taking advantage of this project, we can hand over this problem to our computer, solving this by NLP and machine learning. In this project, exploratory data analysis was first conducted to get some insights hidden in the dataset by visualizing. Then, according to the matrix of token counts, different schools of philosophy were clustered together to understand the similarity between them. In the last step, by using sentiment analysis classification, our algorithm acquired the ability to predict the corresponding school after seeing the sentence only.

Step1: Data harvest: read the csv file and check the basic information of the dataset

Step2: Exploratory data analysis

By this plot, we find the number of 'Aristotle - Complete Works' and 'Plato - Complete Works' are the most.

Similar to the previous one, among all authors, the number of 'Aristotle' and 'Plato' are the most.

This diagram shows the number of 'stoicism' is the least.

By this table and diagram, we can find the distribution of sentence length. Most of the sentence lengths are in the range between 0 and 500. The mean and standard deviation of sentence length are 151 and 105, respectively.

According to the above diagram, most original publication dates are between 1500 and 2000. A small proportion are located at 350 BC.

According to the above three boxplots, sentences in 'Discourse on methods' and 'Second treatise on goverment' tend to have a bigger length. All the other titles, authors, and schools tend to express opinions using similar length.

From this boxplot, we find 'aristotle', 'plato', and 'stoicism' have a relatively longer history. All the others are located in a similar period

This table lists precisely the mean sentence length and original publication date by each school, authors, and titles. Similar insights can be generated from this table.

Wordcloud of each school of philosophy were plotted above, from which we are able to find the word pattern between different schools

Step3: Clustering of schools

Hierarchical clustering was conducted in this step. According to the dendrogram, we find that 'stoicism' and 'nietzsche' are the most similar schools. After this pair, 'communism' and 'capitalism' are most similar.

Step4: Sentiment analysis

In the last step, all sentence strings were converted into matrix of token number, on which logistic regression classifier was fitted. On the test set, our model accuracy is about 76%. Then 10 sentences were randomly selected to test the accuracy of our model in reality. As a result, among these 10 samples, only two sentences were predicted incorrectly. According to our previous conclusion, 'analytic' and 'phenomenology' are very similar. Therefore, for the 139750th sample, this wrong prediction is relatively pardonable.